Monetary Cost Optimizations for HPC Applications on Amazon Clouds: Checkpoints and Replicated Execution

نویسندگان

  • Yifan Gong
  • Amelie Chi Zhou
  • Bingsheng He
چکیده

I. MOTIVATION Recently, we have witnessed that many emerging high performance computing (HPC) or scientific computing applications are developed and hosted in the cloud. As those applications are usually long running jobs and are costly in the cloud, monetary cost [11], [7] and performance [3], [2] are important optimization factors. Message Passing Interface (MPI) is the key programming paradigm for developing HPC and scientific applications. That motivates us to investigate whether and how we can reduce the monetary cost for MPIbased applications with performance constraint in the cloud. Cloud has evolved into an economic market. Besides ondemand instances that charges users at a fixed rate, Amazon EC2 provides spot instances, whose prices are mainly determined by the supply and demand in the market. Table I shows the statistics of the price history of four types of spot and on demand instances on Amazon in the US East region during August 2013. We have the following observations: a) Spot instances are usually much cheaper than on-demand instances. There are some “outlier” points where the maximum price is much higher than the on-demand price. If spot instances are leveraged properly, they can reduce monetary cost [10], in comparison with the solutions with on-demand only. b) Different instance types have different variations on the price. These observations are consistent with the previous studies [6]. Leveraging spot instance is an ideal approach to reduce the monetary cost of MPI executions. However, a spot instance can be terminated whenever the spot price is higher than the bidding price (i.e., an out-of-bid event). We have observed that the spot price is highly dynamic in both spatial and temporal dimensions. For spatial dynamics, clouds (e.g., different Amazon EC2 zones) have very different spot prices. For temporal dynamics, spot prices can be rather stable for some times, and be changing dramatically for other times. Due to the spot price dynamics, failures can occur in MPI executions. In order to satisfy the performance requirement (usually in the form of deadlines), fault tolerant executions are necessary. In this paper, we investigate two common fault-tolerant mechanisms of MPI, including checkpointing and replicated execution. These two mechanisms are actually complementary with each other. Checkpointing can reduce the execution time when the failure occurs and replicated execution can reduce the failure rate in spot market. When the spot price is stable, checkpointing is not necessary. When the spot price varies sharply, checkpointing technique becomes more useful. TABLE I. STATISTICS ON SPOT PRICES ($/HOUR, AUGUST 2013, US EAST REGION) AND ON-DEMAND PRICES OF AMAZON EC2.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable Resource Management in Cloud Computing

The exponential growth of data and application complexity has brought new challenges in the distributed computing field. Scientific applications are growing more diverse with various workloads, including traditional MPI high performance computing (HPC) to fine-grained loosely coupled many-task computing (MTC). Traditionally, these workloads have been shown to run well on supercomputers and high...

متن کامل

CAP: A Cloud Auto-Provisioning Framework for Parallel Processing Using On-demand and Spot Instances

Cloud computing has drawn increasing attention from the scientific computing community due to its ease of use, elasticity, and relatively low cost. Because a high-performance computing (HPC) application is usually resource demanding, without careful planning, it can incur a high monetary expense even in Cloud. We design a tool called CAP (Cloud AutoProvisioning framework for Parallel Processing...

متن کامل

Scheduling Multilevel Deadline-Constrained Scientific Workflows on Clouds Based on Cost Optimization

This paper presents a cost optimizationmodel for scheduling scientificworkflows on IaaS clouds such asAmazonEC2orRackSpace. We assume multiple IaaS clouds with heterogeneous virtual machine instances, with limited number of instances per cloud and hourly billing. Input and output data are stored on a cloud object store such as Amazon S3. Applications are scientific workflows modeled as DAGs as ...

متن کامل

Redundant Execution of Hpc Applications with Mr-mpi

This paper presents a modular-redundant Message Passing Interface (MPI) solution, MR-MPI, for transparently executing high-performance computing (HPC) applications in a redundant fashion. The presented work addresses the deficiencies of recovery-oriented HPC, i.e., checkpoint/restart to/from a parallel file system, at extreme scale by adding the redundancy approach to the HPC resilience portfol...

متن کامل

A VMD Plugin for NAMD Simulations on Amazon EC2

VMD and NAMD are two major molecular dynamics simulation software packages, which can work together for mining structural information of bio-molecules. Carrying out such molecular dynamics simulations can help researchers to understand the roles and functions of various bio-molecules in life science research. Recently, clouds have provided HPC clusters on demand that allow users to benefit from...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014